MASC Dataset

您所在的位置:网站首页 data drawn from MASC Dataset

MASC Dataset

2024-03-15 22:11| 来源: 网络整理| 查看: 265

The Manually Annotated Sub-Corpus (MASC) consists of approximately 500,000 words of contemporary American English written and spoken data drawn from the Open American National Corpus (OANC).

All of MASC includes manually validated annotations for sentence boundaries, token, lemma and POS; noun and verb chunks; named entities (person, location, organization, date); Penn Treebank syntax; coreference; and discourse structure.

Additional manually produced or validated annotations have been produced by the MASC project for portions of the sub-corpus, including full-text annotation for FrameNet frame elements and a 100K+ sentence corpus with WordNet 3.1 sense tags, of which one-tenth are also annotated for FrameNet frame elements.

Annotations of all or portions of the sub-corpus for a wide variety of other linguistic phenomena have been contributed by other projects, including PropBank, TimeBank, Pittsburgh opinion, and several others.

Unlike most freely available corpora including a wide variety of linguistic annotations, MASC contains a balanced selection of texts from a broad range of genres.



【本文地址】


今日新闻


推荐新闻


    CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3